{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 解析库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| 解析器\t| 使用方法\t| 优势\t| 劣势 |\n",
    "|--| -- |-- |-- |\n",
    "| Python标准库 |\tBeautifulSoup(markup, \"html.parser\")\t| Python的内置标准库、执行速度适中 、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差|\n",
    "| lxml HTML 解析器\t| BeautifulSoup(markup, \"lxml\")\t| 速度快、文档容错能力强 | 需要安装C语言库 |\n",
    "| lxml XML 解析器\t| BeautifulSoup(markup, \"xml\") | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |\n",
    "| html5lib\t| BeautifulSoup(markup, \"html5lib\")\t | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 基本使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<html>\n",
      " <head>\n",
      "  <title>\n",
      "   The Dormouse's story\n",
      "  </title>\n",
      " </head>\n",
      " <body>\n",
      "  <p class=\"title\" name=\"dromouse\">\n",
      "   <b>\n",
      "    The Dormouse's story\n",
      "   </b>\n",
      "  </p>\n",
      "  <p class=\"story\">\n",
      "   Once upon a time there were three little sisters; and their names were\n",
      "   <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "    <!-- Elsie -->\n",
      "   </a>\n",
      "   ,\n",
      "   <a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">\n",
      "    Lacie\n",
      "   </a>\n",
      "   and\n",
      "   <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">\n",
      "    Tillie\n",
      "   </a>\n",
      "   ;\n",
      "and they lived at the bottom of a well.\n",
      "  </p>\n",
      "  <p class=\"story\">\n",
      "   ...\n",
      "  </p>\n",
      " </body>\n",
      "</html>\n",
      "The Dormouse's story\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html><head><title>The Dormouse's story</title></head>\n",
    "<body>\n",
    "<p class=\"title\" name=\"dromouse\"><b>The Dormouse's story</b></p>\n",
    "<p class=\"story\">Once upon a time there were three little sisters; and their names were\n",
    "<a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\"><!-- Elsie --></a>,\n",
    "<a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> and\n",
    "<a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>;\n",
    "and they lived at the bottom of a well.</p>\n",
    "<p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.prettify())\n",
    "print(soup.title.string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 标签选择器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 选择元素"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<title>The Dormouse's story</title>\n",
      "<class 'bs4.element.Tag'>\n",
      "<head><title>The Dormouse's story</title></head>\n",
      "<p class=\"title\" name=\"dromouse\"><b>The Dormouse's story</b></p>\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html><head><title>The Dormouse's story</title></head>\n",
    "<body>\n",
    "<p class=\"title\" name=\"dromouse\"><b>The Dormouse's story</b></p>\n",
    "<p class=\"story\">Once upon a time there were three little sisters; and their names were\n",
    "<a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\"><!-- Elsie --></a>,\n",
    "<a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> and\n",
    "<a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>;\n",
    "and they lived at the bottom of a well.</p>\n",
    "<p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.title)\n",
    "print(type(soup.title))\n",
    "print(soup.head)\n",
    "print(soup.p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 获取名称"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "title\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html><head><title>The Dormouse's story</title></head>\n",
    "<body>\n",
    "<p class=\"title\" name=\"dromouse\"><b>The Dormouse's story</b></p>\n",
    "<p class=\"story\">Once upon a time there were three little sisters; and their names were\n",
    "<a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\"><!-- Elsie --></a>,\n",
    "<a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> and\n",
    "<a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>;\n",
    "and they lived at the bottom of a well.</p>\n",
    "<p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.title.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 获取属性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dromouse\n",
      "dromouse\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html><head><title>The Dormouse's story</title></head>\n",
    "<body>\n",
    "<p class=\"title\" name=\"dromouse\"><b>The Dormouse's story</b></p>\n",
    "<p class=\"story\">Once upon a time there were three little sisters; and their names were\n",
    "<a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\"><!-- Elsie --></a>,\n",
    "<a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> and\n",
    "<a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>;\n",
    "and they lived at the bottom of a well.</p>\n",
    "<p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.p.attrs['name'])\n",
    "print(soup.p['name'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 获取内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Dormouse's story\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html><head><title>The Dormouse's story</title></head>\n",
    "<body>\n",
    "<p clss=\"title\" name=\"dromouse\"><b>The Dormouse's story</b></p>\n",
    "<p class=\"story\">Once upon a time there were three little sisters; and their names were\n",
    "<a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\"><!-- Elsie --></a>,\n",
    "<a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> and\n",
    "<a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>;\n",
    "and they lived at the bottom of a well.</p>\n",
    "<p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.p.string)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 嵌套选择"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Dormouse's story\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html><head><title>The Dormouse's story</title></head>\n",
    "<body>\n",
    "<p class=\"title\" name=\"dromouse\"><b>The Dormouse's story</b></p>\n",
    "<p class=\"story\">Once upon a time there were three little sisters; and their names were\n",
    "<a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\"><!-- Elsie --></a>,\n",
    "<a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> and\n",
    "<a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>;\n",
    "and they lived at the bottom of a well.</p>\n",
    "<p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.head.title.string)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 子节点和子孙节点"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['\\n            Once upon a time there were three little sisters; and their names were\\n            ', <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>, '\\n', <a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a>, ' \\n            and\\n            ', <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>, '\\n            and they lived at the bottom of a well.\\n        ']\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html>\n",
    "    <head>\n",
    "        <title>The Dormouse's story</title>\n",
    "    </head>\n",
    "    <body>\n",
    "        <p class=\"story\">\n",
    "            Once upon a time there were three little sisters; and their names were\n",
    "            <a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\">\n",
    "                <span>Elsie</span>\n",
    "            </a>\n",
    "            <a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> \n",
    "            and\n",
    "            <a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>\n",
    "            and they lived at the bottom of a well.\n",
    "        </p>\n",
    "        <p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.p.contents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<list_iterator object at 0x1064f7dd8>\n",
      "0 \n",
      "            Once upon a time there were three little sisters; and their names were\n",
      "            \n",
      "1 <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>\n",
      "2 \n",
      "\n",
      "3 <a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a>\n",
      "4  \n",
      "            and\n",
      "            \n",
      "5 <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "6 \n",
      "            and they lived at the bottom of a well.\n",
      "        \n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html>\n",
    "    <head>\n",
    "        <title>The Dormouse's story</title>\n",
    "    </head>\n",
    "    <body>\n",
    "        <p class=\"story\">\n",
    "            Once upon a time there were three little sisters; and their names were\n",
    "            <a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\">\n",
    "                <span>Elsie</span>\n",
    "            </a>\n",
    "            <a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> \n",
    "            and\n",
    "            <a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>\n",
    "            and they lived at the bottom of a well.\n",
    "        </p>\n",
    "        <p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.p.children)\n",
    "for i, child in enumerate(soup.p.children):\n",
    "    print(i, child)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<generator object descendants at 0x10650e678>\n",
      "0 \n",
      "            Once upon a time there were three little sisters; and their names were\n",
      "            \n",
      "1 <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>\n",
      "2 \n",
      "\n",
      "3 <span>Elsie</span>\n",
      "4 Elsie\n",
      "5 \n",
      "\n",
      "6 \n",
      "\n",
      "7 <a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a>\n",
      "8 Lacie\n",
      "9  \n",
      "            and\n",
      "            \n",
      "10 <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "11 Tillie\n",
      "12 \n",
      "            and they lived at the bottom of a well.\n",
      "        \n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html>\n",
    "    <head>\n",
    "        <title>The Dormouse's story</title>\n",
    "    </head>\n",
    "    <body>\n",
    "        <p class=\"story\">\n",
    "            Once upon a time there were three little sisters; and their names were\n",
    "            <a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\">\n",
    "                <span>Elsie</span>\n",
    "            </a>\n",
    "            <a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> \n",
    "            and\n",
    "            <a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>\n",
    "            and they lived at the bottom of a well.\n",
    "        </p>\n",
    "        <p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.p.descendants)\n",
    "for i, child in enumerate(soup.p.descendants):\n",
    "    print(i, child)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 父节点和祖先节点"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<p class=\"story\">\n",
      "            Once upon a time there were three little sisters; and their names were\n",
      "            <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>\n",
      "<a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a> \n",
      "            and\n",
      "            <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "            and they lived at the bottom of a well.\n",
      "        </p>\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html>\n",
    "    <head>\n",
    "        <title>The Dormouse's story</title>\n",
    "    </head>\n",
    "    <body>\n",
    "        <p class=\"story\">\n",
    "            Once upon a time there were three little sisters; and their names were\n",
    "            <a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\">\n",
    "                <span>Elsie</span>\n",
    "            </a>\n",
    "            <a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> \n",
    "            and\n",
    "            <a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>\n",
    "            and they lived at the bottom of a well.\n",
    "        </p>\n",
    "        <p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.a.parent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, <p class=\"story\">\n",
      "            Once upon a time there were three little sisters; and their names were\n",
      "            <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>\n",
      "<a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a> \n",
      "            and\n",
      "            <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "            and they lived at the bottom of a well.\n",
      "        </p>), (1, <body>\n",
      "<p class=\"story\">\n",
      "            Once upon a time there were three little sisters; and their names were\n",
      "            <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>\n",
      "<a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a> \n",
      "            and\n",
      "            <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "            and they lived at the bottom of a well.\n",
      "        </p>\n",
      "<p class=\"story\">...</p>\n",
      "</body>), (2, <html>\n",
      "<head>\n",
      "<title>The Dormouse's story</title>\n",
      "</head>\n",
      "<body>\n",
      "<p class=\"story\">\n",
      "            Once upon a time there were three little sisters; and their names were\n",
      "            <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>\n",
      "<a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a> \n",
      "            and\n",
      "            <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "            and they lived at the bottom of a well.\n",
      "        </p>\n",
      "<p class=\"story\">...</p>\n",
      "</body></html>), (3, <html>\n",
      "<head>\n",
      "<title>The Dormouse's story</title>\n",
      "</head>\n",
      "<body>\n",
      "<p class=\"story\">\n",
      "            Once upon a time there were three little sisters; and their names were\n",
      "            <a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">\n",
      "<span>Elsie</span>\n",
      "</a>\n",
      "<a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a> \n",
      "            and\n",
      "            <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "            and they lived at the bottom of a well.\n",
      "        </p>\n",
      "<p class=\"story\">...</p>\n",
      "</body></html>)]\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html>\n",
    "    <head>\n",
    "        <title>The Dormouse's story</title>\n",
    "    </head>\n",
    "    <body>\n",
    "        <p class=\"story\">\n",
    "            Once upon a time there were three little sisters; and their names were\n",
    "            <a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\">\n",
    "                <span>Elsie</span>\n",
    "            </a>\n",
    "            <a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> \n",
    "            and\n",
    "            <a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>\n",
    "            and they lived at the bottom of a well.\n",
    "        </p>\n",
    "        <p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(list(enumerate(soup.a.parents)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 兄弟节点"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, '\\n'), (1, <a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a>), (2, ' \\n            and\\n            '), (3, <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>), (4, '\\n            and they lived at the bottom of a well.\\n        ')]\n",
      "[(0, '\\n            Once upon a time there were three little sisters; and their names were\\n            ')]\n"
     ]
    }
   ],
   "source": [
    "html = \"\"\"\n",
    "<html>\n",
    "    <head>\n",
    "        <title>The Dormouse's story</title>\n",
    "    </head>\n",
    "    <body>\n",
    "        <p class=\"story\">\n",
    "            Once upon a time there were three little sisters; and their names were\n",
    "            <a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\">\n",
    "                <span>Elsie</span>\n",
    "            </a>\n",
    "            <a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> \n",
    "            and\n",
    "            <a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>\n",
    "            and they lived at the bottom of a well.\n",
    "        </p>\n",
    "        <p class=\"story\">...</p>\n",
    "\"\"\"\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(list(enumerate(soup.a.next_siblings)))\n",
    "print(list(enumerate(soup.a.previous_siblings)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 标准选择器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### find_all( name , attrs , recursive , text , **kwargs )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可根据标签名、属性、内容查找文档"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<ul class=\"list\" id=\"list-1\">\n",
      "<li class=\"element\">Foo</li>\n",
      "<li class=\"element\">Bar</li>\n",
      "<li class=\"element\">Jay</li>\n",
      "</ul>, <ul class=\"list list-small\" id=\"list-2\">\n",
      "<li class=\"element\">Foo</li>\n",
      "<li class=\"element\">Bar</li>\n",
      "</ul>]\n",
      "<class 'bs4.element.Tag'>\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.find_all('ul'))\n",
    "print(type(soup.find_all('ul')[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>, <li class=\"element\">Jay</li>]\n",
      "[<li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>]\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "for ul in soup.find_all('ul'):\n",
    "    print(ul.find_all('li'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### attrs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<ul class=\"list\" id=\"list-1\" name=\"elements\">\n",
      "<li class=\"element\">Foo</li>\n",
      "<li class=\"element\">Bar</li>\n",
      "<li class=\"element\">Jay</li>\n",
      "</ul>]\n",
      "[<ul class=\"list\" id=\"list-1\" name=\"elements\">\n",
      "<li class=\"element\">Foo</li>\n",
      "<li class=\"element\">Bar</li>\n",
      "<li class=\"element\">Jay</li>\n",
      "</ul>]\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\" name=\"elements\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.find_all(attrs={'id': 'list-1'}))\n",
    "print(soup.find_all(attrs={'name': 'elements'}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<ul class=\"list\" id=\"list-1\">\n",
      "<li class=\"element\">Foo</li>\n",
      "<li class=\"element\">Bar</li>\n",
      "<li class=\"element\">Jay</li>\n",
      "</ul>]\n",
      "[<li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>, <li class=\"element\">Jay</li>, <li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>]\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.find_all(id='list-1'))\n",
    "print(soup.find_all(class_='element'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Foo', 'Foo']\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.find_all(text='Foo'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### find( name , attrs , recursive , text , **kwargs )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "find返回单个元素，find_all返回所有元素"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<ul class=\"list\" id=\"list-1\">\n",
      "<li class=\"element\">Foo</li>\n",
      "<li class=\"element\">Bar</li>\n",
      "<li class=\"element\">Jay</li>\n",
      "</ul>\n",
      "<class 'bs4.element.Tag'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.find('ul'))\n",
    "print(type(soup.find('ul')))\n",
    "print(soup.find('page'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### find_parents()  find_parent()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "find_parents()返回所有祖先节点，find_parent()返回直接父节点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### find_next_siblings()  find_next_sibling()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### find_previous_siblings()  find_previous_sibling()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### find_all_next()  find_next()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### find_all_previous() 和 find_previous()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CSS选择器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过select()直接传入CSS选择器即可完成选择"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<div class=\"panel-heading\">\n",
      "<h4>Hello</h4>\n",
      "</div>]\n",
      "[<li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>, <li class=\"element\">Jay</li>, <li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>]\n",
      "[<li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>]\n",
      "<class 'bs4.element.Tag'>\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "print(soup.select('.panel .panel-heading'))\n",
    "print(soup.select('ul li'))\n",
    "print(soup.select('#list-2 .element'))\n",
    "print(type(soup.select('ul')[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>, <li class=\"element\">Jay</li>]\n",
      "[<li class=\"element\">Foo</li>, <li class=\"element\">Bar</li>]\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "for ul in soup.select('ul'):\n",
    "    print(ul.select('li'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 获取属性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "list-1\n",
      "list-1\n",
      "list-2\n",
      "list-2\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "for ul in soup.select('ul'):\n",
    "    print(ul['id'])\n",
    "    print(ul.attrs['id'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 获取内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Foo\n",
      "Bar\n",
      "Jay\n",
      "Foo\n",
      "Bar\n"
     ]
    }
   ],
   "source": [
    "html='''\n",
    "<div class=\"panel\">\n",
    "    <div class=\"panel-heading\">\n",
    "        <h4>Hello</h4>\n",
    "    </div>\n",
    "    <div class=\"panel-body\">\n",
    "        <ul class=\"list\" id=\"list-1\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "            <li class=\"element\">Jay</li>\n",
    "        </ul>\n",
    "        <ul class=\"list list-small\" id=\"list-2\">\n",
    "            <li class=\"element\">Foo</li>\n",
    "            <li class=\"element\">Bar</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html, 'lxml')\n",
    "for li in soup.select('li'):\n",
    "    print(li.get_text())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 总结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 推荐使用lxml解析库，必要时使用html.parser\n",
    "* 标签选择筛选功能弱但是速度快\n",
    "* 建议使用find()、find_all() 查询匹配单个结果或者多个结果\n",
    "* 如果对CSS选择器熟悉建议使用select()\n",
    "* 记住常用的获取属性和文本值的方法"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
